Instructions

Part One: Definitions

Write a definition for each term below. Use complete sentences.

Question 1: Corpus

A corpus is an object that contains many strings and that is annotated with metadata. The metadata associated with it can include information like which document the string is from.

Question 2: Stop words

A list of words that are to be excluded from the analysis. Words like “the”, “and”,“I” are very common in all forms of literature and by removing them we can get a better look at what unique words are actually common within texts.

Question 3: Token

A token is a piece of a larger collection of information. This can be an individual word out of a string. For example if we take a speech and split it up into individual words the speech has been tokenized and a word is a token.

Question 4: Word cloud

A word cloud is a visualization that is a grouping of words with their size representing their frequency. If the word “Data” is the most common word in a dataset then it will be the largest word in the visualization.

Part Two: Learn about the Data

Question 5

When a new president is elected in the United States, he or she gives a speech at the inauguration ceremony. The data you will be using in this lab contains the text of inaugural addresses from all of the past US Presidents.

The data files you are using in this lab came from the website Kaggle.com. Navigate to the the links below and read the documentation for the two datasets.

Write a few sentences below describing both datasets in your own words:

The Presidential Inaugural Speeches dataset contains data on each Speech given my a president in the United States. Each entry in the dataset includes the president name, what address it was for like the first inaugural address or just inaugural address, the date the gave the speech and then the full transcript of the speech.

The US Presidents dataset however just has one entry per president and some attributes associated with them such as their start and end term time, who the actual president was, what their occupation was before president, their supporting party, and who their vice president was.

Question 6

The data files need to be loaded and manipulated a bit to prepare them for later analysis in this lab. Run the R code Dr. Rader has provided in the code chunk below to prepare the data for analysis. Make sure you have downloaded the two data files from D2L and saved them in the same folder as this file.

## Parsed with column specification:
## cols(
##   Name = col_character(),
##   `Inaugural Address` = col_character(),
##   Date = col_character(),
##   text = col_character()
## )
## Parsed with column specification:
## cols(
##   start = col_character(),
##   end = col_character(),
##   president = col_character(),
##   prior = col_character(),
##   party = col_character(),
##   vice = col_character()
## )

Question 7

The two data files have been combined into one data table. Answer each questions below about the data table speeches. Use whatever R commands you need to in order to get the answers. Not all of the questions require you to write code.

speeches
  • What is each row in the data table?

Each row in the new tibble is a speech (Inaugural Address) that a president gave in the past.

  • What are each of the columns in the data table? Include the name of each column below, and a description, including the type of data in each column.

Each of the columns in the data table include variables that were in the statement “select(president, title, party, line, text)”. These columns include:

  • president - president who gave the speech
  • title - what kind of address it was
  • party - supporting party of the president
  • line - row number
  • text - transcript of the speech

  • What are the dimensions of the data table?

The data table after manipulation has 5715 speech entries and 5 columns (variables).

Part Three: Tidy Text

Question 8

Describe what “tidy” text is in your own words:

Question 9

In the empty R code chunk below, transform the data table speeches into a “tidy” text table, and save it as an object called tidyspeeches. Use the unnest_tokens function. The first argument of unnest_tokens() is what you want the new column containing the words to be called (you should call it word), and the second argument is which column has the text in it that will be broken up into tokens. Note: there is an example of how to do this in the Week 15 lecture R Markdown file, in lines 51-52.

tidyspeeches = speeches %>%
  unnest_tokens(word, text)
tidyspeeches

Question 10

  • The code you wrote in the previous question transformed the speeches data table. Describe the changes that you made to the data table, in your own words:

We tokenized each of the speeches into individual words.

  • What are the dimensions of the new tidyspeeches data table, after you created it in Question 9?

the new dataset tidyspeeches has 140036 entries and 5 columns (variables).

Question 11

In the empty R code chunk below, use the anti_join function along with the stopwords list included in the tidytext library to remove the stop words from the tidyspeeches data table. Save the output of this command back to the tidyspeeches object. Note: there is an example of how to do this in the Week 15 lecture R Markdown file, in line 106.

tidyspeeches = tidyspeeches %>%
  mutate(line = row_number()) %>% 
  anti_join(stop_words)      # stop_words is an object in tidytext
## Joining, by = "word"
tidyspeeches

Question 12

Next, look at the overall word frequencies in the inaugural speech data. Run the code in the code chunk below:

tidyspeeches %>% count(word) %>% arrange(desc(n))

Notice that it looks like the most common words are “government” and “people”. “Country” is up there too. There are also words like “america” and “american”. It seems like there are some words like these that we should probably filter out of the data table, because they are so common across the inauguration speeches.

In an R code chunk below, write a dplyr command to filter out the following words:

government, people, world, country, nation, america, american, americans, united, constitution, citizens

Don’t forget to save the result of the command back to the tidyspeeches object.

speeches_stopwords <- tibble(word = c("government", "people", "world", "country", "nation", "america", "american", "americans", "united", "constitution", "citizens"))

tidyspeeches = tidyspeeches %>%
  anti_join(speeches_stopwords)      # stop_words is an object in tidytext
## Joining, by = "word"
tidyspeeches
tidyspeeches %>% count(word) %>% arrange(desc(n))

Part Four: Word Clouds

Question 13

Starting in line 131 of the Week 15 Lecture R Markdown file, there are two examples of how to make word clouds out of “tidy” text data. Following those examples, create word clouds for the following presidential inauguration speeches:

George Washington, First Inaugural Address

tidyspeeches %>% 
  filter(president == "George Washington") %>%
  filter(title == "First Inaugural Address") %>%
  count(word) %>% 
  with(wordcloud(word, n, max.words = 100))

Abraham Lincoln, First Inaugural Address

tidyspeeches %>% 
  filter(president == "Abraham Lincoln") %>%
  filter(title == "First Inaugural Address") %>%
  count(word) %>% 
  with(wordcloud(word, n, max.words = 100))
## Warning in wordcloud(word, n, max.words = 100): constitutional could not be
## fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): expressly could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): laws could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): section could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): power could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): enforced could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): property could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): purpose could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): fugitive could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): service could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): friends could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): intercourse could not be
## fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): congress could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): institutions could not be
## fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): parties could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): offices could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): contract could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): oath could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): written could not be fit on
## page. It will not be plotted.

Franklin D. Roosevelt, Third Inaugural Address

tidyspeeches %>% 
  filter(president == "Franklin D. Roosevelt") %>%
  filter(title == "Third Inaugural Address") %>%
  count(word) %>% 
  with(wordcloud(word, n, max.words = 100))
## Warning in wordcloud(word, n, max.words = 100): democracy could not be fit
## on page. It will not be plotted.

John F. Kennedy, Inaugural Address

tidyspeeches %>% 
  filter(president == "John F. Kennedy") %>%
  filter(title == "Inaugural Address") %>%
  count(word) %>% 
  with(wordcloud(word, n, max.words = 100))
## Warning in wordcloud(word, n, max.words = 100): pledge could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): president could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): generation could not be fit
## on page. It will not be plotted.

Barack Obama, Second Inaugural Address

tidyspeeches %>% 
  filter(president == "Barack Obama") %>%
  filter(title == "Second Inaugural Address") %>%
  count(word) %>% 
  with(wordcloud(word, n, max.words = 100))
## Warning in wordcloud(word, n, max.words = 100): generation could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): god could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): time could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): freedom could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): meaning could not be fit on
## page. It will not be plotted.

Donald J. Trump, Inaugural Address

tidyspeeches %>% 
  filter(president == "Donald J. Trump") %>%
  filter(title == "Inaugural Address") %>%
  count(word) %>% 
  with(wordcloud(word, n, max.words = 100))
## Warning in wordcloud(word, n, max.words = 100): millions could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): countries could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): dreams could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): borders could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): families could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): factories could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): foreign could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): obama could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): power could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): jobs could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): bring could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): heart could not be fit on
## page. It will not be plotted.

Part Five: Sentiment Analysis

Recall that sentiment analysis requires a dictionary of words and their associated “sentiment”, like positive, negative, or neutral. Fortunately, the tidytext library makes sentiment analysis very easy, because it has a couple of built-in sentiment dictionaries.

Question 14

In the empty R code chunk below, use the “bing” sentiment dictionary that is part of the tidytext library to add a sentiment column to the tidyspeeches data table. Save this as a new object called sentiments. Note: there is an example of how to do this starting in line 157 of the Week 15 lecture R Markdown file.

sentiments = tidyspeeches %>%
  inner_join(get_sentiments("bing"))
## Joining, by = "word"
sentiments

Question 15

Graph the percent of positive and negative words in each of the six inaugural speeches you made word clouds out of, above:

  • George Washington, First Inaugural Address
  • Abraham Lincoln, First Inaugural Address
  • Franklin D. Roosevelt, Third Inaugural Address
  • John F. Kennedy, Inaugural Address
  • Barack Obama, Second Inaugural Address
  • Donald J. Trump, Inaugural Address

The beginning of the R command is included for you below. You need to finish it. Make sure that there is one facet per president. Note: there is an example of how to do this starting in line 183 of the Week 15 lecture R Markdown file.

sentiments %>% 
  filter((president == "George Washington" & title == "First Inaugural Address") | 
           (president == "Abraham Lincoln" & title == "First Inaugural Address") | 
           (president == "Franklin D. Roosevelt" & title == "Third Inaugural Address") | 
           (president == "John F. Kennedy" & title == "Inaugural Address") | 
           (president == "Barack Obama" & title == "Second Inaugural Address") | 
           (president == "Donald J. Trump" & title == "Inaugural Address")) %>% 
  group_by(president, sentiment) %>%

  summarize(n = n()) %>% 
  mutate(percent = n / sum(n) * 100) %>% 
  ggplot(aes(x = sentiment, y = percent, fill = sentiment)) + 
    geom_bar(stat = "identity") +
    facet_wrap(~president) +
    labs(x = "President", 
      y="Percentage of Positive and Negative Words", 
      title="Proportions of Positive and Negative Words in Presidential Inagural Speeches") 

Extra credit: TF-IDF

Starting on line 306 in the Week 15 lecture R Markdown file, there is an example of how to calculate TF-IDF for a corpus, and then graph the high TF-IDF words in each of six Shakespeare plays.

Following this example, calculate TF-IDF for six inaugural speeches from the tidyspeeches data table. You can use the same six as above, or explore other speeches, it is your choice. Because titles are not unique identifiers of speeches, the data table needs a unique identifier for each speech. This command is started for you below, including the code to create a unique identifier for each speech, speech.

speeches_tfidf <- tidyspeeches %>% 
  mutate(speech = paste(president, title))

Then, following the example starting in line 333 in the Week 15 lecture R Markdown file, make a faceted bar graph like the one in that file showing the high TF-IDF words in each of the six speeches you have selected to graph. This is started for you below (assuming you want to use the same six speeches as above).

tfidf_ranks <- speeches_tfidf %>%
  filter(speech %in% c("George Washington First Inaugural Address", 
                       "Abraham Lincoln First Inaugural Address", 
                       "Franklin D. Roosevelt Third Inaugural Address", 
                       "John F. Kennedy Inaugural Address", 
                       "Barack Obama Second Inaugural Address", 
                       "Donald J. Trump Inaugural Address")) %>% 
  group_by(speech)